Skip to content

Main#5

Merged
jgibson2 merged 18 commits intopolycamfrom
main
Apr 24, 2026
Merged

Main#5
jgibson2 merged 18 commits intopolycamfrom
main

Conversation

@jgibson2
Copy link
Copy Markdown
Collaborator

Summary

[PLEASE REMOVE] See CONTRIBUTING.md's Pull Requests for ExecuTorch PR guidelines.

[PLEASE REMOVE] If this PR closes an issue, please add a Fixes #<issue-id> line.

[PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: " label. For a list of available release notes labels, check out CONTRIBUTING.md's Pull Requests.

Test plan

[PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable.

xingguo01 and others added 18 commits April 23, 2026 18:08
…18767)

- Add support for quantized clamp-type activations in the Cortex-M
pipeline by canonicalizing relu/hardtanh/clamp to quantized
aten.clamp.default for standalone int8 paths
- Extend activation fusion to cover max_pool2d.

@freddan80 @per @zingo @oscarandersson8218 @digantdesai
@Sebastian-Larsson @AdrianLundell @psiddh

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils
@Sebastian-Larsson @robell

Signed-off-by: Xingguo Li <xingguo.li@arm.com>
…ch#18971)

FuseConstantArgsPass resolved input_qparams by flattened input-node
index, while FoldAndAnnotateQParamsPass stores them by top-level
argument index. For aten.cat with a list-valued tensor argument, this
caused only the first tensor to be dequantized before folding, which
corrupted the fused constant.

Resolve qparams by top-level argument index and propagate that qparam
through nested list and tuple arguments. Add a regression test for
quantized aten.cat constant folding with list-valued tensor inputs.

Signed-off-by: Per Held <per.held@arm.com>
Change-Id: I6e1a012d82a5dbeecb403c440a2944953dd5cba7
Fixes pytorch#10736

Formats `third-party/CMakeLists.txt` using `cmake-format` to improve
readability and consistency.

**Changes:**
- Reformatted `ExternalProject_Add(...)` blocks for `flatbuffers` and
`flatcc`
- Reflowed `set_target_properties(...)`, `set(...)` cache variables, and
`install(...)` calls
- No functional changes — formatting only
All 4 tests failed because they called forward() with zero arguments on
mobilenet_v2 which expects a [1,3,224,224] float input. This was a test
bug, not a runtime bug. Add a dummyInput() helper that creates a
Tensor.ones with the correct shape, and remove all @ignore annotations.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Differential Revision: D101887672

Pull Request resolved: pytorch#19035
Differential Revision: D102189156

Pull Request resolved: pytorch#19077
…rch#19092)

Add cause-chaining constructor to ExecutorchRuntimeException so wrapped
exceptions preserve the original cause in the stack trace.

Restore detailed native error messages in LlmModule.load() — the null
runner case now reports the model_type_category and valid values instead
of a generic message. Load failures now throw from JNI with the specific
error code and description.

This commit was authored with the help of Claude.
…ytorch#18959)

Summary:

The CUDA runtime shims for sort operations use Half (float16) dtype, but
it was
not defined in the slim ScalarType enum, causing compiler warnings
treated as
errors (-Werror=switch). This adds proper Half support to the slim
ScalarType
enum so switch statements can use the enum value directly instead of
casting
to the underlying type.

Differential Revision: D101218928
1. Attacker sets that flag on an external tensor.
2. xnnpack thinks the tensor is owned by itself, and frees it inside the
backend.
3. et runtime also frees it at method destruction.


Test Plan:
Build and run executor runner against problematic PTE file:
```
# Build executor runner:
cmake -B cmake-out \
    -DEXECUTORCH_BUILD_EXECUTOR_RUNNER=ON \
    -DEXECUTORCH_BUILD_XNNPACK=ON
cmake --build cmake-out -j16 --target executor_runner

# Output
(executorch) [lfq@devvm11764.nha0 /data/users/lfq/security/executorch (f9f29e7)]$ ./cmake-out/executor_runner --model_path=/data/users/lfq/security/executorch_repros/TOB-EXECUTORCH-44.pte   
```
Previous
```
(executorch) [lfq@devvm11764.nha0 /data/users/lfq/security/executorch (security44)]$ ./cmake-out/executor_runner --model_path=/data/users/lfq/security/executorch_repros/TOB-EXECUTORCH-44.pte     
Note (XNNPACK): l1_data_cache_bytes=32768, l1_data_cache_line_size=64, l1_data_cache_associativity=8, l1_data_cache_num_sets=64. (init_hardware_config, /data/users/lfq/security/executorch/backends/xnnpack/third-party/XNNPACK/src/configs/hardware-config.c:417)
Note (XNNPACK): l2_data_cache_bytes=1048576, l2_data_cache_line_size=64, l2_data_cache_associativity=8, l2_data_cache_num_sets=2048. (init_hardware_config, /data/users/lfq/security/executorch/backends/xnnpack/third-party/XNNPACK/src/configs/hardware-config.c:436)
I 00:00:00.002612 executorch:cpuinfo_utils.cpp:71] Reading file /sys/devices/soc0/image_version
I 00:00:00.002640 executorch:cpuinfo_utils.cpp:87] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.002657 executorch:cpuinfo_utils.cpp:100] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.002664 executorch:cpuinfo_utils.cpp:109] Failed to open midr file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.002671 executorch:cpuinfo_utils.cpp:125] CPU info and manual query on # of cpus dont match.
I 00:00:00.002672 executorch:executor_runner.cpp:223] Resetting threadpool with num threads = 0
I 00:00:00.002722 executorch:executor_runner.cpp:374] Model file /data/users/lfq/security/executorch_repros/TOB-EXECUTORCH-44.pte is loaded.
I 00:00:00.002729 executorch:executor_runner.cpp:384] Using method forward
I 00:00:00.002739 executorch:executor_runner.cpp:435] Setting up planned buffer 0, size 112.
E 00:00:00.002806 executorch:XNNCompiler.cpp:331] Tensor value has unsupported flag bits 0xffffff00
E 00:00:00.002824 executorch:XNNPACKBackend.cpp:122] XNNCompiler::compileModel failed: 0x23
E 00:00:00.002827 executorch:method.cpp:127] Init failed for backend XnnpackBackend: 0x23
F 00:00:00.002830 executorch:executor_runner.cpp:459] In function main(), assert failed (method.ok()): Loading of method forward failed with status 0x23
Aborted (core dumped)
```

After, graceful error
```
(executorch) [lfq@devvm11764.nha0 /data/users/lfq/security/executorch (security44)]$ ./cmake-out/executor_runner --model_path=/data/users/lfq/security/executorch_repros/TOB-EXECUTORCH-44.pte     
Note (XNNPACK): l1_data_cache_bytes=32768, l1_data_cache_line_size=64, l1_data_cache_associativity=8, l1_data_cache_num_sets=64. (init_hardware_config, /data/users/lfq/security/executorch/backends/xnnpack/third-party/XNNPACK/src/configs/hardware-config.c:417)
Note (XNNPACK): l2_data_cache_bytes=1048576, l2_data_cache_line_size=64, l2_data_cache_associativity=8, l2_data_cache_num_sets=2048. (init_hardware_config, /data/users/lfq/security/executorch/backends/xnnpack/third-party/XNNPACK/src/configs/hardware-config.c:436)
I 00:00:00.002562 executorch:cpuinfo_utils.cpp:71] Reading file /sys/devices/soc0/image_version
I 00:00:00.002595 executorch:cpuinfo_utils.cpp:87] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.002607 executorch:cpuinfo_utils.cpp:100] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.002618 executorch:cpuinfo_utils.cpp:109] Failed to open midr file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.002623 executorch:cpuinfo_utils.cpp:125] CPU info and manual query on # of cpus dont match.
I 00:00:00.002628 executorch:executor_runner.cpp:223] Resetting threadpool with num threads = 0
I 00:00:00.002672 executorch:executor_runner.cpp:374] Model file /data/users/lfq/security/executorch_repros/TOB-EXECUTORCH-44.pte is loaded.
I 00:00:00.002678 executorch:executor_runner.cpp:384] Using method forward
I 00:00:00.002688 executorch:executor_runner.cpp:435] Setting up planned buffer 0, size 112.
E 00:00:00.002750 executorch:XNNCompiler.cpp:331] Tensor value has unsupported flag bits 0xffffff00
E 00:00:00.002761 executorch:XNNPACKBackend.cpp:122] XNNCompiler::compileModel failed: 0x23
E 00:00:00.002769 executorch:method.cpp:127] Init failed for backend XnnpackBackend: 0x23
F 00:00:00.002772 executorch:executor_runner.cpp:459] In function main(), assert failed (method.ok()): Loading of method forward failed with status 0x23
```

Co-authored-by: Github Executorch <github_executorch@arm.com>
Co-authored-by: Claude <noreply@anthropic.com>
…M (v1) (pytorch#18859)

The original SmolLM2 PR (pytorch#9354) started as v1 support, was renamed to
`smollm2` during review, but the repo ID and `rope_theta` were never
updated to v2 values. The two checkpoints are genuinely different models
(0/272 tensors match).

- `HUGGING_FACE_REPO_IDS["smollm2"]`: `HuggingFaceTB/SmolLM-135M` →
`HuggingFaceTB/SmolLM2-135M`
- `examples/models/smollm2/135M_config.json`: `rope_theta` `10000.0` →
`100000.0` (matches [SmolLM2-135M HF
config](https://huggingface.co/HuggingFaceTB/SmolLM2-135M/blob/main/config.json))

### Test plan

Data-only change (one string, one number). Verified values match the
upstream HuggingFace SmolLM2-135M config.
Add tryTo accessors for each value. Previously, `toTensor` etc. abort
with ET_CHECK_MSG when the type mismatches.

API additions:
- Per-type: tryToInt, tryToDouble, tryToBool, tryToScalar, tryToString,
  tryToTensor (already present, kept), tryToIntList, tryToBoolList,
  tryToDoubleList, tryToTensorList, tryToListOptionalTensor,
  tryToScalarType, tryToMemoryFormat, tryToLayout, tryToDevice.
  Tag mismatch returns Error::InvalidType; null list/string payload
  returns Error::InvalidState.
- Templated tryTo<T>() dispatcher mirroring to<T>(), via a new
  EVALUE_DEFINE_TRY_TO macro kept adjacent to EVALUE_DEFINE_TO so drift
  between the two surfaces is visible at review time.
- tryToOptional<T>() widened from Tensor-only to generic, delegating
  to tryTo<T>() so it works for any supported payload type.

Tests cover success + mismatch paths for each new accessor, plus the
widened tryToOptional<T>() path.

Authored-with: Claude

---------

Co-authored-by: Github Executorch <github_executorch@arm.com>
…rity (pytorch#18917)

Differential Revision: D99769848

Pull Request resolved: pytorch#18917
…rch#19095)

This PR makes GPU related operator cuda-backend specific, to bring metal
qwen 3.5 moe ci back
Disable fusing of ops that have symbolic shapes as arguments. Also
disable fusing of TOSA dialect ops.

cc @digantdesai @freddan80 @per @zingo @mansnils @Sebastian-Larsson
@robell

Signed-off-by: Oscar Andersson <oscar.andersson@arm.com>
Adds util for computing a value range from a symbolic expression.

cc @digantdesai @freddan80 @per @zingo @mansnils @Sebastian-Larsson
@robell

Signed-off-by: Oscar Andersson <oscar.andersson@arm.com>
The removed copy seems to be stale, it is never used.
…ch#18088)

## Summary

This PR adds a fused `llama::recurrent_gated_delta_rule` custom op and
wires Qwen3.5 GatedDeltaNet attention to use it instead of the Python
per-token recurrence loop when the op is available.

It also tightens local custom-op loading so we no longer implicitly scan
repo-local `cmake-out*` directories, and adds coverage for
recurrent-state correctness, chunked prefill behavior, and export graph
selection.

## What changed

- added `llama::recurrent_gated_delta_rule` runtime and AOT
registrations
- updated Qwen3.5 GatedDeltaNet attention to use the fused op with
Python fallback preserved
- tightened `custom_ops_aot_lib` discovery:
  - default to package-local discovery
  - allow explicit override via `EXECUTORCH_CUSTOM_OPS_AOT_LIB`
  - removed implicit repo-local `cmake-out*` scanning
- added tests for:
  - recurrent op parity vs reference
  - `.out` variant behavior
  - chunked-state parity vs full-sequence execution
  - custom-op vs fallback attention parity
  - tiny Qwen3.5 export selecting `llama.recurrent_gated_delta_rule`

## Validation

### Linux CPU-only (aarch64)

Built `custom_ops_aot_lib` successfully and loaded it via
`EXECUTORCH_CUSTOM_OPS_AOT_LIB`.

Passed:
- `pytest
extension/llm/custom_ops/test_update_cache.py::RecurrentGatedDeltaRuleTest
-q`
  - `3 passed`
- `pytest examples/models/llama/tests/test_qwen3_5_attention.py -q`
  - `7 passed`
- `pytest
examples/models/llama/tests/test_export_llama_lib.py::ExportLlamaLibTest::test_tiny_qwen35_export_uses_recurrent_gated_delta_rule
-q`
  - `1 passed`

### Real-model CPU validation

On a real `Qwen3.5-0.8B` CPU run, fused recurrence matched the fallback
path on next-token selection with very small logit drift, and improved
eager prefill latency on the tested prompt.

Observed on local CPU validation:
- same next token from fused path vs fallback
- max logit diff on the order of `1e-5`
- eager prefill speedup about `1.6x` on the tested prompt

### Windows note

A local Windows-only FFHT/MSVC workaround was used during development to
keep the local build usable, but that workaround is intentionally
**not** included in this PR.

## Non-goals / separate issues

I did not treat the local `program.fbs` serialization issue as part of
this change.

This branch does not modify `exir/_serialize/*` or `schema/program.fbs`,
and serialization-focused checks passed on both this branch and clean
`main` once the local environment was set up correctly.

A separate end-to-end tiny Qwen3.5 `.pte` export probe hit:
- `RuntimeError: Missing out variants: {'aten::alias'}`

That appears to be a separate pre-existing export issue outside this
change set.

cc @larryliu0820 @mergennachin @cccclai @helunwencser @jackzhxng

---------

Co-authored-by: Digant Desai <digantdesai@meta.com>
Co-authored-by: Nikhil Viswanath Sivakumar <68182521+nil-is-all@users.noreply.github.com>
@jgibson2 jgibson2 merged commit 6e6c2a7 into polycam Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.